15 research outputs found
Robust Speaker Recognition Using Speech Enhancement And Attention Model
In this paper, a novel architecture for speaker recognition is proposed by
cascading speech enhancement and speaker processing. Its aim is to improve
speaker recognition performance when speech signals are corrupted by noise.
Instead of individually processing speech enhancement and speaker recognition,
the two modules are integrated into one framework by a joint optimisation using
deep neural networks. Furthermore, to increase robustness against noise, a
multi-stage attention mechanism is employed to highlight the speaker related
features learned from context information in time and frequency domain. To
evaluate speaker identification and verification performance of the proposed
approach, we test it on the dataset of VoxCeleb1, one of mostly used benchmark
datasets. Moreover, the robustness of our proposed approach is also tested on
VoxCeleb1 data when being corrupted by three types of interferences, general
noise, music, and babble, at different signal-to-noise ratio (SNR) levels. The
obtained results show that the proposed approach using speech enhancement and
multi-stage attention models outperforms two strong baselines not using them in
most acoustic conditions in our experiments.Comment: Acceptted by Odyssey 202
Speaker Re-identification with Speaker Dependent Speech Enhancement
While the use of deep neural networks has significantly boosted speaker
recognition performance, it is still challenging to separate speakers in poor
acoustic environments. Here speech enhancement methods have traditionally
allowed improved performance. The recent works have shown that adapting speech
enhancement can lead to further gains. This paper introduces a novel approach
that cascades speech enhancement and speaker recognition. In the first step, a
speaker embedding vector is generated , which is used in the second step to
enhance the speech quality and re-identify the speakers. Models are trained in
an integrated framework with joint optimisation. The proposed approach is
evaluated using the Voxceleb1 dataset, which aims to assess speaker recognition
in real world situations. In addition three types of noise at different
signal-noise-ratios were added for this work. The obtained results show that
the proposed approach using speaker dependent speech enhancement can yield
better speaker recognition and speech enhancement performances than two
baselines in various noise conditions.Comment: Acceptted for presentation at Interspeech202
Contextual Joint Factor Acoustic Embeddings
Embedding acoustic information into fixed length representations is of
interest for a whole range of applications in speech and audio technology. Two
novel unsupervised approaches to generate acoustic embeddings by modelling of
acoustic context are proposed. The first approach is a contextual joint factor
synthesis encoder, where the encoder in an encoder/decoder framework is trained
to extract joint factors from surrounding audio frames to best generate the
target output. The second approach is a contextual joint factor analysis
encoder, where the encoder is trained to analyse joint factors from the source
signal that correlates best with the neighbouring audio. To evaluate the
effectiveness of our approaches compared to prior work, two tasks are conducted
-- phone classification and speaker recognition -- and test on different TIMIT
data sets. Experimental results show that one of the proposed approaches
outperforms phone classification baselines, yielding a classification accuracy
of 74.1%. When using additional out-of-domain data for training, an additional
3% improvements can be obtained, for both for phone classification and speaker
recognition tasks.Comment: Published at SLT202
Supervised Speaker Embedding De-Mixing in Two-Speaker Environment
Separating different speaker properties from a multi-speaker environment is
challenging. Instead of separating a two-speaker signal in signal space like
speech source separation, a speaker embedding de-mixing approach is proposed.
The proposed approach separates different speaker properties from a two-speaker
signal in embedding space. The proposed approach contains two steps. In step
one, the clean speaker embeddings are learned and collected by a residual TDNN
based network. In step two, the two-speaker signal and the embedding of one of
the speakers are both input to a speaker embedding de-mixing network. The
de-mixing network is trained to generate the embedding of the other speaker by
reconstruction loss. Speaker identification accuracy and the cosine similarity
score between the clean embeddings and the de-mixed embeddings are used to
evaluate the quality of the obtained embeddings. Experiments are done in two
kind of data: artificial augmented two-speaker data (TIMIT) and real world
recording of two-speaker data (MC-WSJ). Six different speaker embedding
de-mixing architectures are investigated. Comparing with the performance on the
clean speaker embeddings, the obtained results show that one of the proposed
architectures obtained close performance, reaching 96.9% identification
accuracy and 0.89 cosine similarity.Comment: Published at SLT202
Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks
In speaker recognition, deep neural networks deliver state-of-the-art performance due
to their large capacities and powerful feature extraction abilities. However, this performance can be highly affected by interference from background noise and other speakers.
This thesis focuses on new neural network architectures that are designed to overcome
such interference and thereby improve the robustness of the speaker recognition system.
In order to improve the noise robustness of the speaker recognition model, two
novel network architectures are proposed. The first is the hierarchical attention network, which is able to capture both local and global features in order to improve the
robustness of the network. The experimental results show it can deliver results that
are comparable to the published state-of-the-art methods, reaching 4.28% equal error
rate using the Voxceleb1 training and test sets. The second approach is the speech
enhancement and speaker recognition joint system that consists of two networks; the
first integrates speech enhancement and speaker recognition into one framework to
better filter out noise, while the other makes further use of speaker embeddings input to a speech enhancement network. This provides prior knowledge for the speech
enhancement network which improves its performance. The results show that a joint
system with a speaker dependent speech enhancement model can deliver results that
are comparable to the published state-of-the-art methods, reaching 4.15% equal error
rate using the Voxceleb1 training and test sets.
In order to overcome interfering speaker, two novel approaches are proposed. The
first is referred to as the embedding de-mixing approach that separates the speaker and content properties from a two-speaker signal in an embedding space, rather than
in a signal space. The results show that the de-mixed embeddings are close to the
clean embeddings in terms of quality, and the back-end speaker recognition model can
make use of the de-mixed embeddings to reach 96.9% speaker identification accuracy,
compared to those achieved using clean embeddings (98.5%) on TIMIT dataset. The
second approach is the first end-to-end weakly supervised speaker identification approach based on a novel hierarchical transformer network architecture. The results
show that the proposed model can capture speaker properties from two speakers in
one input utterance. The hierarchical transformer network can reach more than 3%
relative improvement compared to the baselines in all of the test conditions
Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification
Identifying multiple speakers without knowing where a speaker's voice is in a
recording is a challenging task. In this paper, a hierarchical attention
network is proposed to solve a weakly labelled speaker identification problem.
The use of a hierarchical structure, consisting of a frame-level encoder and a
segment-level encoder, aims to learn speaker related information locally and
globally. Speech streams are segmented into fragments. The frame-level encoder
with attention learns features and highlights the target related frames
locally, and output a fragment based embedding. The segment-level encoder works
with a second attention layer to emphasize the fragments probably related to
target speakers. The global information is finally collected from segment-level
module to predict speakers via a classifier. To evaluate the effectiveness of
the proposed approach, artificial datasets based on Switchboard Cellular part1
(SWBC) and Voxceleb1 are constructed in two conditions, where speakers' voices
are overlapped and not overlapped. Comparing to two baselines the obtained
results show that the proposed approach can achieve better performances.
Moreover, further experiments are conducted to evaluate the impact of utterance
segmentation. The results show that a reasonable segmentation can slightly
improve identification performances.Comment: Acceptted for presentation at Interspeech202
Towards Low-Resource StarGAN Voice Conversion using Weight Adaptive Instance Normalization
Many-to-many voice conversion with non-parallel training data has seen
significant progress in recent years. StarGAN-based models have been interests
of voice conversion. However, most of the StarGAN-based methods only focused on
voice conversion experiments for the situations where the number of speakers
was small, and the amount of training data was large. In this work, we aim at
improving the data efficiency of the model and achieving a many-to-many
non-parallel StarGAN-based voice conversion for a relatively large number of
speakers with limited training samples. In order to improve data efficiency,
the proposed model uses a speaker encoder for extracting speaker embeddings and
conducts adaptive instance normalization (AdaIN) on convolutional weights.
Experiments are conducted with 109 speakers under two low-resource situations,
where the number of training samples is 20 and 5 per speaker. An objective
evaluation shows the proposed model is better than the baseline methods.
Furthermore, a subjective evaluation shows that, for both naturalness and
similarity, the proposed model outperforms the baseline method.Comment: Accepted by ICASSP202
H-VECTORS: Utterance-level Speaker Embedding Using A Hierarchical Attention Model
In this paper, a hierarchical attention network to generate utterance-level
embeddings (H-vectors) for speaker identification is proposed. Since different
parts of an utterance may have different contributions to speaker identities,
the use of hierarchical structure aims to learn speaker related information
locally and globally. In the proposed approach, frame-level encoder and
attention are applied on segments of an input utterance and generate individual
segment vectors. Then, segment level attention is applied on the segment
vectors to construct an utterance representation. To evaluate the effectiveness
of the proposed approach, NIST SRE 2008 Part1 dataset is used for training, and
two datasets, Switchboard Cellular part1 and CallHome American English Speech,
are used to evaluate the quality of extracted utterance embeddings on speaker
identification and verification tasks. In comparison with two baselines,
X-vector, X-vector+Attention, the obtained results show that H-vectors can
achieve a significantly better performance. Furthermore, the extracted
utterance-level embeddings are more discriminative than the two baselines when
mapped into a 2D space using t-SNE